-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: column aware row encoding: improve the implementation and add bench #17818
Conversation
src/common/src/util/value_encoding/column_aware_row_encoding.rs
Outdated
Show resolved
Hide resolved
The sorted implementation is still 10% faster than this hash-table implementation.
Considering the effort of refactoring the code and additional logic for compatibility, perhaps we can postpone it until necessary. |
src/common/src/util/value_encoding/column_aware_row_encoding.rs
Outdated
Show resolved
Hide resolved
@@ -176,7 +168,7 @@ impl ValueRowSerializer for Serializer { | |||
/// Should non-null default values be specified, a new field could be added to Deserializer | |||
#[derive(Clone)] | |||
pub struct Deserializer { | |||
needed_column_ids: BTreeMap<i32, usize>, | |||
required_column_ids: HashMap<i32, usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I thought BTreeMap
is faster when there are only a few entries: https://arc.net/l/quote/okdycbqi
The current value of B
is 6. According to this, can we also benchmark the case for a table containing less than 6 columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added bench column_aware_row_encoding_4_columns
. Observed a similar performance improvement.
Benchmarking column_aware_row_encoding_16_columns_encode: Collecting 100 samples in estimated 5.0015
column_aware_row_encoding_16_columns_encode
time: [491.00 ns 492.30 ns 493.95 ns]
change: [-3.6361% -2.8341% -2.0340%] (p = 0.00 < 0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe
Benchmarking column_aware_row_encoding_16_columns_decode: Collecting 100 samples in estimated 5.0011
column_aware_row_encoding_16_columns_decode
time: [445.62 ns 448.76 ns 451.61 ns]
change: [-51.073% -50.248% -49.515%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) low mild
2 (2.00%) high mild
2 (2.00%) high severe
Benchmarking column_aware_row_encoding_4_columns_encode: Collecting 100 samples in estimated 5.0009 s (23M iter
column_aware_row_encoding_4_columns_encode
time: [221.45 ns 224.51 ns 228.98 ns]
change: [+0.1302% +0.7332% +1.4856%] (p = 0.03 < 0.05)
Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
2 (2.00%) high mild
5 (5.00%) high severe
Benchmarking column_aware_row_encoding_4_columns_decode: Collecting 100 samples in estimated 5.0004 s (38M iter
column_aware_row_encoding_4_columns_decode
time: [131.47 ns 131.82 ns 132.26 ns]
change: [-46.562% -46.074% -45.656%] (p = 0.00 < 0.05)
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low mild
5 (5.00%) high mild
5 (5.00%) high severe
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Resolves #16763
Unfortunately, currently we don't ensure the persisted column IDs to be sorted. For example,
create table t (a varchar, b int, c int)
will create a state table that has column ids1,2,3,0
, where0
always refers to_row_id
.As a result, I reverted the order-based implementation and use a hash-table now.
Benchmark result (on my MacBook)
Why
ahash::HashMap
?According to my benchmark,
ahash::HashMap
is the fastest implementation.Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.